Quinn
Author

Quinn He

Published

November 21, 2022

Code
library(tidyverse)
library(sf)
library(mapview)
#library(summarytools)
library(GGally)
library(stargazer)

knitr::opts_chunk$set(echo = TRUE)

Notes from class

Why did I choose particular interaction terms?

Why did I choose the variables I did for certain models? There must be a reason for each different model

Data Read-in

Here is the Miami housing dataset I am using for the project.

Code
miami_housing <- read_csv("~/Documents/DACSS Program/data/miami-housing.csv")
Rows: 13932 Columns: 17
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (17): LATITUDE, LONGITUDE, PARCELNO, SALE_PRC, LND_SQFOOT, TOT_LVG_AREA,...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Below I read in data I may end up using that relates to the county and state election data. For now, my main data set is miami_housing.

Code
county_election <- read_csv("~/Documents/DACSS Program/data/president_county.csv")
Rows: 4633 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): state, county
dbl (3): current_votes, total_votes, percent

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
state_election <- read_csv("~/Documents/DACSS Program/data/president_state.csv")
Rows: 52 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): state
dbl (1): total_votes

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
president_county <- read_csv("~/Documents/DACSS Program/data/president_county_candidate.csv")
Rows: 32177 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): state, county, candidate, party
dbl (1): total_votes
lgl (1): won

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Information about data set

The dataset contains information on 13,932 single-family homes sold in Miami .

Names before column rename: PARCELNO: unique identifier for each property. About 1% appear multiple times. SALE_PRC: sale price (\() LND_SQFOOT: land area (square feet) TOTLVGAREA: floor area (square feet) SPECFEATVAL: value of special features (e.g., swimming pools) (\)) RAIL_DIST: distance to the nearest rail line (an indicator of noise) (feet) OCEAN_DIST: distance to the ocean (feet) WATER_DIST: distance to the nearest body of water (feet) CNTR_DIST: distance to the Miami central business district (feet) SUBCNTR_DI: distance to the nearest subcenter (feet) HWY_DIST: distance to the nearest highway (an indicator of noise) (feet) age: age of the structure avno60plus: dummy variable for airplane noise exceeding an acceptable level structure_quality: quality of the structure month_sold: sale month in 2016 (1 = jan) LATITUDE LONGITUDE

I first want to rename some columns because I am not a fan of the format of the stock column names,

Code
miami_housing <- miami_housing %>% 
  rename("latitude" = "LATITUDE",
         "longitude" = "LONGITUDE",
         "sale_price" = "SALE_PRC",  
         "land_sqfoot" = "LND_SQFOOT",  
         "floor_sqfoot" = "TOT_LVG_AREA",
         "special_features" = "SPEC_FEAT_VAL",  
         "dist_2_rail" = "RAIL_DIST",  
         "dist_2_ocean" = "OCEAN_DIST", 
         "dist_2_nearest_water" = "WATER_DIST",  
         "dist_2_biz_center" = "CNTR_DIST",  
         "dis_2_nearest_subcenter"= "SUBCNTR_DI", 
         "dist_2_hiway" = "HWY_DIST", #close distance is a negative trait 
         "home_age" = "age") 

I wonder if it would be worth creating another column called “county” based off of the longitude and latitude coordinates. This would make some graphs more interesting since I could fill by county in a ggplot graph.

Research Question

Does the saying “location, location, location” really matter when it comes to housing prices in Miami or are there other factors as well that contribute to price?

Hypothesis

Outcome variable: Sale price of houses in Miami

Explanatory variable: Distance from the ocean, measured in feet

Hypothesis: Houses closer (lower dist_2_ocean) will have a higher price than houses farther.

Descriptive Statistics

I’ll have to go further in depth into this research that I found.

According to Redfin, the median sale price for houses in Miami is $530,000. https://www.redfin.com/city/11458/FL/Miami/housing-market

https://www.proquest.com/docview/222418064?pq-origsite=gscholar&fromopenview=true - This study found that increase in house prices from 2003-2004 was largely due to interest rates, housing construction, unemployment, and household income.

https://link.springer.com/article/10.1023/A:1007751112669 - uses OLS to predict house prices - The problem with the above procedure has to do with neighborhood effects. Every realtor knows that location is an extremely important determinant of the price of a house, and yet incorporating neighborhood effects into an ordinary regression model is problematic, for two reasons: (1) neighborhood quality is an unobservable variable, and (2) the measurement of neighborhood quality ( presuming that it can be measured) requires the knowledge of neighborhood boundaries. The ®rst issue can be resolved by using proxies for neighborhood quality. Variables such as crime rates, school quality measures (such as scores on standardized tests),3 and socioeconomic characteristics (such as race and income) have all been used as proxies for neighborhood quality. - The data used in the analysis come from 1978 multiple listings for Baltimore, Maryland. - The original data set consists of 2157 house sales. Of these, 664 records contained missing values, leaving 1493 valid observations. The valid records were randomly divided into two groupsÐan estimation sample and a prediction sample. The estimation sample contains 1000 observations, and the prediction sample contains 493 observations. -

https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1475-4932.2005.00243.x?casa_token=Lk-oinHQfCoAAAAA:3YFA9GRG6r9UwIJa8-8Z40x77ENRHm07d9CQ7dLdRZD5VoGjSVB4hUUcKU8yWiJapg0csSONiwxZzqWJ

https://www.sciencedirect.com/science/article/pii/S1051137712000228?casa_token=DwJ93qbzqLAAAAAA:8vClfNoxs09D_UL4BCg-Ds36vurfjk8t0uK6Hft50ytFeWo3-XaFlsj5r0WBW4lB1jGqgoaqwXA

https://link.springer.com/article/10.1007/s11146-007-9053-7

https://eprints.gla.ac.uk/221903/1/221903.pdf

Exploratory Analysis

Code
str(miami_housing)
spec_tbl_df [13,932 × 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ latitude               : num [1:13932] 25.9 25.9 25.9 25.9 25.9 ...
 $ longitude              : num [1:13932] -80.2 -80.2 -80.2 -80.2 -80.2 ...
 $ PARCELNO               : num [1:13932] 6.22e+11 6.22e+11 6.22e+11 6.22e+11 6.22e+11 ...
 $ sale_price             : num [1:13932] 440000 349000 800000 988000 755000 630000 1020000 850000 250000 1220000 ...
 $ land_sqfoot            : num [1:13932] 9375 9375 9375 12450 12800 ...
 $ floor_sqfoot           : num [1:13932] 1753 1715 2276 2058 1684 ...
 $ special_features       : num [1:13932] 0 0 49206 10033 16681 ...
 $ dist_2_rail            : num [1:13932] 2816 4359 4413 4585 4063 ...
 $ dist_2_ocean           : num [1:13932] 12811 10648 10574 10156 10837 ...
 $ dist_2_nearest_water   : num [1:13932] 348 338 297 0 327 ...
 $ dist_2_biz_center      : num [1:13932] 42815 43505 43530 43798 43600 ...
 $ dis_2_nearest_subcenter: num [1:13932] 37742 37340 37329 37423 37551 ...
 $ dist_2_hiway           : num [1:13932] 15955 18125 18200 18514 17903 ...
 $ home_age               : num [1:13932] 67 63 61 63 42 41 63 21 56 63 ...
 $ avno60plus             : num [1:13932] 0 0 0 0 0 0 0 0 0 0 ...
 $ month_sold             : num [1:13932] 8 9 2 9 7 2 2 9 3 11 ...
 $ structure_quality      : num [1:13932] 4 4 4 4 4 4 5 4 4 5 ...
 - attr(*, "spec")=
  .. cols(
  ..   LATITUDE = col_double(),
  ..   LONGITUDE = col_double(),
  ..   PARCELNO = col_double(),
  ..   SALE_PRC = col_double(),
  ..   LND_SQFOOT = col_double(),
  ..   TOT_LVG_AREA = col_double(),
  ..   SPEC_FEAT_VAL = col_double(),
  ..   RAIL_DIST = col_double(),
  ..   OCEAN_DIST = col_double(),
  ..   WATER_DIST = col_double(),
  ..   CNTR_DIST = col_double(),
  ..   SUBCNTR_DI = col_double(),
  ..   HWY_DIST = col_double(),
  ..   age = col_double(),
  ..   avno60plus = col_double(),
  ..   month_sold = col_double(),
  ..   structure_quality = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 
Code
summary(miami_housing)
    latitude       longitude         PARCELNO           sale_price     
 Min.   :25.43   Min.   :-80.54   Min.   :1.020e+11   Min.   :  72000  
 1st Qu.:25.62   1st Qu.:-80.40   1st Qu.:1.079e+12   1st Qu.: 235000  
 Median :25.73   Median :-80.34   Median :3.040e+12   Median : 310000  
 Mean   :25.73   Mean   :-80.33   Mean   :2.356e+12   Mean   : 399942  
 3rd Qu.:25.85   3rd Qu.:-80.26   3rd Qu.:3.060e+12   3rd Qu.: 428000  
 Max.   :25.97   Max.   :-80.12   Max.   :3.660e+12   Max.   :2650000  
  land_sqfoot     floor_sqfoot  special_features  dist_2_rail     
 Min.   : 1248   Min.   : 854   Min.   :     0   Min.   :   10.5  
 1st Qu.: 5400   1st Qu.:1470   1st Qu.:   810   1st Qu.: 3299.4  
 Median : 7500   Median :1878   Median :  2766   Median : 7106.3  
 Mean   : 8621   Mean   :2058   Mean   :  9562   Mean   : 8348.5  
 3rd Qu.: 9126   3rd Qu.:2471   3rd Qu.: 12352   3rd Qu.:12102.6  
 Max.   :57064   Max.   :6287   Max.   :175020   Max.   :29621.5  
  dist_2_ocean     dist_2_nearest_water dist_2_biz_center
 Min.   :  236.1   Min.   :    0        Min.   :  3826   
 1st Qu.:18079.3   1st Qu.: 2676        1st Qu.: 42823   
 Median :28541.8   Median : 6923        Median : 65852   
 Mean   :31691.0   Mean   :11960        Mean   : 68490   
 3rd Qu.:44310.7   3rd Qu.:19200        3rd Qu.: 89358   
 Max.   :75744.9   Max.   :50400        Max.   :159976   
 dis_2_nearest_subcenter  dist_2_hiway        home_age       avno60plus     
 Min.   :  1463          Min.   :   90.2   Min.   : 0.00   Min.   :0.00000  
 1st Qu.: 23996          1st Qu.: 2998.1   1st Qu.:14.00   1st Qu.:0.00000  
 Median : 41110          Median : 6159.8   Median :26.00   Median :0.00000  
 Mean   : 41115          Mean   : 7723.8   Mean   :30.67   Mean   :0.01493  
 3rd Qu.: 53949          3rd Qu.:10854.2   3rd Qu.:46.00   3rd Qu.:0.00000  
 Max.   :110554          Max.   :48167.3   Max.   :96.00   Max.   :1.00000  
   month_sold     structure_quality
 Min.   : 1.000   Min.   :1.000    
 1st Qu.: 4.000   1st Qu.:2.000    
 Median : 7.000   Median :4.000    
 Mean   : 6.656   Mean   :3.514    
 3rd Qu.: 9.000   3rd Qu.:4.000    
 Max.   :12.000   Max.   :5.000    

At a glance, I do not care about summary statistics for longitude and latitude. Sale price is worth noting with a minimum of $72,000 and a max of $2.6 million. There are some pretty old homes in the data set with the most at almost 100 years old. A minimum age of zero may be something to watch out for because it is not clear as to if the home is less than 1 year old. I must also find out how the distance is measured. I believe distance in this data set is measured in feet so we can see there may be a few water front properties that I will guess fetch a high sale price. Homes farther away may be cheaper if they are not located near another desirable location.

While typically a negative, the distance to highway can be viewed as a positive trait to a home. For the sake of this analysis and to go with typical association with close proximity to a highway, I will view closer distance as a negative.

Another note, is I will have one linear model, m1, to act as the model with all of the variables in it. I want it as a baseline model to see the factor all of the variables play in house price. Since multicollinearity may be a factor, I will not use this model too much.

Code
m1 <- lm(sale_price ~ ., data = miami_housing)

Hypothesis Testing

Code
t.test(miami_housing$sale_price)

    One Sample t-test

data:  miami_housing$sale_price
t = 148.82, df = 13931, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 394674.1 405209.8
sample estimates:
mean of x 
 399941.9 

Checking for Multicollinearity

I want to run a correlation matrix for some of the variables so I can see if there will be issues of multicollinearity. This is checking for multicollinearity for the distance_to variables

Code
ggpairs(miami_housing, columns = 8:13)

Thankfully, it looks like I only have to keep note of a few variables that look to be highly correlated with one another. Distance to business center and subcenter must be in the same area, or at least close enough because 0.766 seems a bit too correlated. Ocean distance and water distance are worth noting, but not to the same degree as the prior two variables since 0.49 is just below what I would consider significant.

Simple Linear Regression

By running a few linear models, I am getting closer to the conclusion that there is no one inidicator of sale price, but multiple. Multiple linear regression models will have to be run in order to make a sound conclusion as to what is the most significant indicator of price for single family homes.

Code
m2 <- lm(sale_price ~ dist_2_ocean, data = miami_housing)
Code
m3 <- lm(sale_price ~ home_age, data = miami_housing)
Code
m4 <- lm(sale_price ~ land_sqfoot, data = miami_housing)
Code
summary(lm(sale_price ~ longitude*latitude, data = miami_housing))

Call:
lm(formula = sale_price ~ longitude * latitude, data = miami_housing)

Residuals:
    Min      1Q  Median      3Q     Max 
-471277 -166378  -60079   73821 2365411 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)         1.850e+10  4.998e+08   37.02   <2e-16 ***
longitude           2.301e+08  6.220e+06   37.00   <2e-16 ***
latitude           -7.139e+08  1.939e+07  -36.83   <2e-16 ***
longitude:latitude -8.880e+06  2.413e+05  -36.81   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 294200 on 13928 degrees of freedom
Multiple R-squared:  0.1399,    Adjusted R-squared:  0.1397 
F-statistic:   755 on 3 and 13928 DF,  p-value: < 2.2e-16

According to this model with longitude and latitude as interaction terms, they alone are not too significant in the prediction of house prices.

Code
stargazer(m2, m3, m4, type = "text", ci.level = 0.9)

=============================================================================
                                             Dependent variable:             
                                 --------------------------------------------
                                                  sale_price                 
                                      (1)            (2)            (3)      
-----------------------------------------------------------------------------
dist_2_ocean                       -4.952***                                 
                                    (0.147)                                  
                                                                             
home_age                                        -1,850.640***                
                                                  (126.087)                  
                                                                             
land_sqfoot                                                      18.974***   
                                                                  (0.413)    
                                                                             
Constant                         556,875.800*** 456,699.700*** 236,370.100***
                                  (5,323.914)    (4,697.537)    (4,349.770)  
                                                                             
-----------------------------------------------------------------------------
Observations                         13,932         13,932         13,932    
R2                                   0.075          0.015          0.132     
Adjusted R2                          0.075          0.015          0.132     
Residual Std. Error (df = 13930)  305,024.700    314,801.200    295,578.300  
F Statistic (df = 1; 13930)       1,136.729***    215.427***    2,115.150*** 
=============================================================================
Note:                                             *p<0.1; **p<0.05; ***p<0.01

Visualization

I don’t love looking at the notation for sale price, but for now it will do. This visualization of price and distance to ocean is about how I expected it would go. We can see homes closer to the ocean go for the most amount of money, but at the ~40,000 and ~60,000ft distance, there is a spike. This I would like to figure out.

Code
ggplot(miami_housing, aes(x = dist_2_ocean, y = sale_price))+
  geom_point()+
  geom_smooth(method = lm)
`geom_smooth()` using formula 'y ~ x'

I use mapview here just to get a birds-eye view of the entire Miami area.

Code
mapview(miami_housing, xcol= "longitude", ycol = "latitude", legend = mapviewGetOption("sale_price"),crs = 4269, grid = FALSE)

Multiple Linear Regression

Code
m5 <- lm(sale_price ~ land_sqfoot + dist_2_biz_center*dist_2_ocean + dist_2_hiway + structure_quality*home_age, data = miami_housing)

#summary(m5)

This first model I ran is a good start. An adjusted R-squared is quite high when I compare it with the first model. Every variable is considered significant here, even the interaction terms. I chose to have distance to ocean and business center interact because I deemed those two as the most important distance variables and I wanted them both to be taken into account together. Home age and structure quality also interact because typically a home that is older will have a poorer structure. Let me check this to see if they are correlated.

Code
cor.test(miami_housing$home_age, miami_housing$structure_quality, method = "kendall")

    Kendall's rank correlation tau

data:  miami_housing$home_age and miami_housing$structure_quality
z = 4.7869, p-value = 1.694e-06
alternative hypothesis: true tau is not equal to 0
sample estimates:
       tau 
0.03187774 

There appears to be no correlation between structure and age, which is interesting because their reaction is significant according to the model. This model is a good starting place to see what factors the most in sale price.

Code
m6 <- lm(sale_price ~ land_sqfoot + floor_sqfoot + dist_2_nearest_water*dist_2_biz_center*dis_2_nearest_subcenter + dist_2_hiway, data = miami_housing)

#summary(m6)

This model seems to fit better than the previous one with an adjusted R squared of 0.6. The F statistic is higher than that of the previous model which I will take note of later. First I want to run other models and see how they compare.

Code
m7 <- lm(log(sale_price) ~ land_sqfoot*floor_sqfoot + dist_2_nearest_water*dist_2_biz_center*dis_2_nearest_subcenter + dist_2_hiway + special_features, data = miami_housing)

summary(m7)

Call:
lm(formula = log(sale_price) ~ land_sqfoot * floor_sqfoot + dist_2_nearest_water * 
    dist_2_biz_center * dis_2_nearest_subcenter + dist_2_hiway + 
    special_features, data = miami_housing)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.50805 -0.16030  0.01149  0.18193  1.79821 

Coefficients:
                                                                 Estimate
(Intercept)                                                     1.230e+01
land_sqfoot                                                     2.202e-05
floor_sqfoot                                                    5.204e-04
dist_2_nearest_water                                           -3.554e-05
dist_2_biz_center                                              -9.197e-06
dis_2_nearest_subcenter                                        -1.658e-05
dist_2_hiway                                                    1.264e-05
special_features                                                5.748e-06
land_sqfoot:floor_sqfoot                                       -7.706e-09
dist_2_nearest_water:dist_2_biz_center                          5.306e-10
dist_2_nearest_water:dis_2_nearest_subcenter                    3.729e-10
dist_2_biz_center:dis_2_nearest_subcenter                       1.643e-10
dist_2_nearest_water:dist_2_biz_center:dis_2_nearest_subcenter -7.737e-15
                                                               Std. Error
(Intercept)                                                     2.228e-02
land_sqfoot                                                     1.209e-06
floor_sqfoot                                                    5.704e-06
dist_2_nearest_water                                            2.334e-06
dist_2_biz_center                                               3.907e-07
dis_2_nearest_subcenter                                         5.326e-07
dist_2_hiway                                                    5.748e-07
special_features                                                2.266e-07
land_sqfoot:floor_sqfoot                                        3.591e-10
dist_2_nearest_water:dist_2_biz_center                          3.258e-11
dist_2_nearest_water:dis_2_nearest_subcenter                    5.504e-11
dist_2_biz_center:dis_2_nearest_subcenter                       8.382e-12
dist_2_nearest_water:dist_2_biz_center:dis_2_nearest_subcenter  5.147e-16
                                                               t value Pr(>|t|)
(Intercept)                                                    552.072  < 2e-16
land_sqfoot                                                     18.218  < 2e-16
floor_sqfoot                                                    91.234  < 2e-16
dist_2_nearest_water                                           -15.228  < 2e-16
dist_2_biz_center                                              -23.537  < 2e-16
dis_2_nearest_subcenter                                        -31.129  < 2e-16
dist_2_hiway                                                    21.993  < 2e-16
special_features                                                25.369  < 2e-16
land_sqfoot:floor_sqfoot                                       -21.460  < 2e-16
dist_2_nearest_water:dist_2_biz_center                          16.288  < 2e-16
dist_2_nearest_water:dis_2_nearest_subcenter                     6.775 1.29e-11
dist_2_biz_center:dis_2_nearest_subcenter                       19.607  < 2e-16
dist_2_nearest_water:dist_2_biz_center:dis_2_nearest_subcenter -15.034  < 2e-16
                                                                  
(Intercept)                                                    ***
land_sqfoot                                                    ***
floor_sqfoot                                                   ***
dist_2_nearest_water                                           ***
dist_2_biz_center                                              ***
dis_2_nearest_subcenter                                        ***
dist_2_hiway                                                   ***
special_features                                               ***
land_sqfoot:floor_sqfoot                                       ***
dist_2_nearest_water:dist_2_biz_center                         ***
dist_2_nearest_water:dis_2_nearest_subcenter                   ***
dist_2_biz_center:dis_2_nearest_subcenter                      ***
dist_2_nearest_water:dist_2_biz_center:dis_2_nearest_subcenter ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.308 on 13919 degrees of freedom
Multiple R-squared:  0.7059,    Adjusted R-squared:  0.7057 
F-statistic:  2784 on 12 and 13919 DF,  p-value: < 2.2e-16
Code
m7.1 <- lm(sale_price ~ land_sqfoot*floor_sqfoot + dist_2_nearest_water*dist_2_biz_center*dis_2_nearest_subcenter + dist_2_hiway + special_features, data = miami_housing)

I want to run the par function on this model to see where it’s at but to also get a sense of at least one models diagnostics.

Code
par(mfrow = c(2,3)); plot(m7, which = 1:6) #log

Code
par(mfrow = c(2,3)); plot(m7.1, which = 1:6) #unlog

I must check if the first graph is horizontal enough to meet the qualified standards. No data set is perfect so I need to confirm this is fine to keep. The data may be slightly skewed since the data is linear until the 2 mark of Theoretical Quantities. This indicates the model and data set are not normally distributed.

The fitted values and residuals graph checks for linearity and constant variance assumptions. The constant variance assumption is violated here.

I logged the sale price which seemed to correct the plots tremendously. The QQ and residuals fitted seemed to even out and fall along the lines of a normal distribution.

Initially, before I logged sales_price, the QQ plot was not as linear as I would have hoped. After logging the sales_price variable, it fixed the residuals vs fitted graph as well as the QQ plot. I created another model to compare the logged and unlogged diagnostic graphs. Currently, I do not think I will include this in the final project.